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Abstract. In 1866 Gregor Mendel published a seminal paper contain- 
ing the foundations of modern genetics. In 1936 Ronald Fisher pub- 
lished a statistical analysis of Mendel's data concluding that "the data 
of most, if not all, of the experiments have been falsified so as to agree 
closely with Mendel's expectations." The accusation gave rise to a con- 
troversy which has reached the present time. There are reasonable 
grounds to assume that a certain unconscious bias was systematically 
introduced in Mendel's experimentation. Based on this assumption, a 
probability model that fits Mendel's data and does not offend Fisher's 
analysis is given. This reconciliation model may well be the end of the 
Mendel-Fisher controversy. 

Key words and phrases: Genetics, ethics, chi-square tests, distribution 
of p- values, minimum distance estimates. 
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1. INTRODUCTION 

Gregor Mendel is recognized as a brilliant scientist 
and the founder of modern genetics. However, long 
ago, another eminent scientist, the statistician and 
geneticist, Sir Ronald Fisher, questioned Mendel's 
integrity claiming that Mendel's data agree better 
with his theory than expected under natural fluc- 
tuations. Fisher's conclusion is based on strong sta- 
tistical arguments and has been interpreted as an 
evidence of misconduct. A large number of papers 
about this controversy have been produced, culmi- 
nating with the publication in 2008 of a book (Frank- 
lin et al., 2008) aimed at ending the polemic and def- 
initely rehabilitating Mendel's image. However, the 
authors recognize, "the issue of the 'too good to be 
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true ' aspect of Mendel 's data found by Fisher still 
stands" . 

After submitting Mendel's data and Fisher's sta- 
tistical analysis to extensive computations and Monte 
Carlo simulations, attempting to discover a hidden 
explanation that could either confirm or refute Fisher's 
allegation, we have concluded that a statistical model 
with a simple probability mechanism can clarify the 
controversy, that is, explain Fisher's conclusions with- 
out accusing Mendel (or any assistant) of deliberate 
fraud. 

The paper is organized as follows. In Section 2 we 
summarize the history of the controversy. Then, in 
Section 3, we present a brief description of Mendel's 
experiments and of the data under consideration. In 
Section 4 we examine previous statistical analyses of 
Mendel's data, including Fisher's chi-square analy- 
sis and a meta-analysis of p-values. In Section 5 we 
present the proposed statistical model and show how 
it can explain the pending issues. The conclusions of 
this work are summed up in Section 6. 

2. A BRIEF HISTORY OF THE 
MENDEL-FISHER CONTROVERSY 

To situate the reader within the context of the 
subject matter, we first highlight the most signifi- 
cant characteristics of the two leading figures and 



2 



A. M. PIRES AND J. A. BRANCO 




(a) (b) 

Fig. 1. (a) Mendel's portrait which appeared as frontispiece 
in the book Mendel's Principles of Heredity, a Defense, Bate- 
son (1902). (b) A picture of Sir R. A. Fisher. 

review the key aspects and chronology of the con- 
troversy. 

Gregor Mendel [1822-1884, Figure 1(a)] was an 
Augustinian Austrian monk who, during at least 
seven years, performed controlled crossing experi- 
ments with the garden pea (Pisum sativum L.). He 
may have personally controlled the fertilization of 
around 29,000 plants. Based on the results of these 
experiments, he formulated the two laws, or princi- 
ples, of heredity (Mendel's first law: principle of seg- 
regation; Mendel's second law: principle of indepen- 
dent assortment). Mendel's findings were published 
in 1866 in the Proceedings of the Society of Natural 
History of Briinn, Mendel (1866). To draw his con- 
clusions, Mendel analyzed the data informally, that 
is, without using formal statistical methods, sim- 
ply because the tools he needed did not exist. Yet 
he shows a remarkable intuition for statistical con- 
cepts, being quite aware of chance, variability and 
random errors. This shows how Mendel was a man 
far ahead of his time. 

Sir Ronald Fisher [1890-1962, Figure 1(b)] made 
fundamental contributions to statistics and is also 
regarded as the founder of quantitative genetics. He 
is described by Hald (1998) as "a genius who almost 
single-handedly created the foundations for modern 
statistical science" and by Dawkins (1995) as u the 
greatest of Darwin's successors." It is thus quite un- 
derstandable that Fisher became interested in Men- 
del's work and data very early in his career. 

Let us now review the chronology of this contro- 
versy: 

1856-1863 Mendel performed his experiments dur- 
ing this period. He produced around 29,000 gar- 
den pea plants from controlled crosses and reg- 
istered several of their observable characteristics 



(phenotype), such as shape and color of the seeds, 
height, flower color, etc. 

1865 Mendel presented the results of his experiments 
in a communication entitled Experiments on Plant 
Hybridization, read at two meetings of the Society 
of Natural History of Briinn. 

1866 The paper with the same title was published 
in the proceedings of that society. The paper had 
little impact and would be cited only three times 
in the next 35 years. 

1900 His work was rediscovered independently by Hu- 
go de Vries, Carl Correns and Erich von Tschermak. 

1902 The first statistical analysis of Mendel's data is 
published in the first volume of Biometrika (Wel- 
don, 1902), using the then recently invented chi- 
square test (Pearson, 1900). 

1911 Fisher produced a first comment about Men- 
del's results, in a communication to the Cambridge 
University Eugenics Society, while he was still an 
undergraduate: 11 It is interesting that Mendel's ori- 
ginal results all fall within the limits of proba- 
ble error" and suggested that Mendel may have 
" unconsciously placed doubtful plants on the side 
which favoured his hypothesis" (Franklin et al., 
2008, page 16). 

1936 Fisher published the paper Has Mendel's work 
been rediscovered? (Fisher, 1936), where he ex- 
presses the same concern but this time presenting 
a detailed analysis, both of Mendel's experiments 
and data. He also attributes the alleged forgery, 
not to Mendel himself, but to an unknown assis- 
tant: " Although no explanation can be expected to 
be satisfactory, it remains a possibility among oth- 
ers that Mendel was deceived by some assistant 
who knew too well what was expected" (Fisher, 
1936, page 132). Fisher also questioned some other 
aspects of Mendel's experiments, but those do not 
involve statistical aspects and will not be discussed 
here. 

1964 The publication De Beer (1964), intended to 
celebrate the centennial of Mendel's article, high- 
lights the fact that Fisher "was able to reconstruct 
the sequence thread and development of Mendel 's 
series of experiments" and draws attention to Fi- 
sher's work on the statistical analysis of Mendel's 
results. Ironically, Fisher's paper appears to have 
remained mostly overlooked until approximately 
this anniversary, as far as we can tell based on the 
scarcity of previous citations. 

1964-2007 During this period at least 50 papers have 
been published about the controversy created by 
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Fisher. Some elucidative titles: The too-good-to- 
be-true paradox and Greg or Mendel (Pilgrim, 1984); 
Are Mendel's results really too close? (Edwards, 
1986a); Mud Sticks: On the Alleged Falsification 
of Mendel's Data (Hartl and Fairbanks, 2007). 
2008 A group of scientists from different fields, 
Franklin (Physics and History of Science), Ed- 
wards (Biometry and Statistics, and curiously, Fi- 
sher's last student), Fairbanks (Plant and Wildlife 
Sciences), Hartl (Biology and Genetics) and Sei- 
denfeld (Philosophy and Statistics), who have pre- 
viously published work on the controversy, merged 
their most relevant papers and published the book 
Ending the Mendel-Fisher Controversy. But is it 
really the end of the controversy? The authors dis- 
miss all of the issues raised by Fisher except the 
u too good to be true" (pages 68 and 310). 
In a very interesting book review, entitled CSI: 
Mendel, Stigler (2008) adds: ". . . an actual end to 
that discussion is unlikely to be a consequence of 
this book." and ". . . thanks to these lucid, insight- 
ful and balanced articles, another generation will 
be able to join the quest with even better under- 
standing." 

3. EXPERIMENTS AND DATA 

Before introducing the data and discussing the 
corresponding statistical analysis, it is important to 
understand the experiments and the scientific hy- 
potheses under evaluation. Using a classification sim- 
ilar to that used by Fisher, the experiments can be 
classified as follows: single trait, bifactorial, trifac- 
torial and gametic ratios experiments. 

Single trait experiments. These concern the trans- 
mission of only one binary characteristic (or trait) at 
a time. Mendel examined seven traits, two observ- 
able in the seeds (seed shape: round or wrinkled; 
seed color: yellow or green) and five in the plants 
(flower color: purple or white; pod shape: inflated or 
constricted; pod color: yellow or green; flower posi- 
tion: axial or terminal; stem length: long or short). 
First Mendel obtained what are now called "pure 
lines," with each of the two forms of the seven char- 
acters, that is, plants which yielded perfectly con- 
stant and similar offspring. When crossing the two 
pure lines, Fq, for each character Mendel observed 
that all the progeny, F\ , presented only one of the 
forms of the trait. He called this one the dominant 
form and represented it by A. The other form was 
called recessive and denoted by a. In the seven traits 
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Fig. 2. A schematic representation of Mendel's single trait 
experiments (in modern notation and terminology). 

listed above the first form is the dominant and the 
second is the recessive. He then crossed the F\ indi- 
viduals (which he called the hybrids) and observed 
that in the resulting generation, F2, there were indi- 
viduals of the two original types, approximately in 
the ratio 3 : 1 of the dominant type to the recessive 
type. In modern notation and terminology, we are 
studying a phenotype with possible values U A" and 
"a" governed by a single gene with two alleles (A 
and a, where the first is dominant). The Fq plants 
are homozygous AA (genotype AA, phenotype "A") 
or aa (genotype aa, phenotype "a"), the F\ are all 
heterozygous Aa (genotype Aa, phenotype "A"), 
the F2 plants can have genotype AA (phenotype 
U A"), genotype Aa (phenotype U A") and genotype 
aa (phenotype "a"). When Mendel self- fertilized the 
F2 plants with phenotype a A" he found that about 
one-third of these always produced phenotype U A" 
progeny, while about two-thirds produced pheno- 
type "A" and phenotype "a" progeny in the ratio 
3:1. This process is schematically represented in 
Figure 2, where (F3) refers to the progeny of the 
self- fertilized F2 individuals. 

Table 1 presents the data given in Mendel (1866) 
for the single trait experiments just described. As an 
illustration of the variability of the results between 
plants, Mendel also presented the individual figures 
obtained for the ten first plants of each of the ex- 
periments relative to the seed characteristics (these 
are referred to by Fisher as "illustrations of plant 
variation," cf. Table 5). 

Bifactorial experiment. This is an experiment sim- 
ilar to the single trait experiments but observing 
two characteristics simultaneously (seed shape, A, 
and seed color, B, starting from pure lines on both). 
The aim was to observe how the two traits are com- 
bined. Mendel postulated and confirmed from the 
results of the experiment that the traits considered 
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Table 1 

Data given in Mendel (1866) for the single trait experiments. "A" ("a") denotes the dominant (recessive) phenotype; A (a) 
denotes the dominant (recessive) allele; n is the total number of observations per experiment (that is, seeds for the seed trait 
experiments and plants otherwise); naj^« , fi" a ", hao and uaa denote observed frequencies 
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Table 2 

Data from the bifactorial experiment [as organized by Fisher 
(1936)] 
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BB 
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bb 
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67 
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132 


Total 


138 


265 


126 


529 



are assorted independently. That is, given a trait 
A with an F2 generation AA, Aa and aa in the 
ratio 1:2: 1, and a trait B with BB, Bb and bb 
in the same ratio, combining the two independently 
leads to the genotypes and theoretical ratios repre- 
sented in Figure 3. The data, organized by Fisher 
from Mendel's description, are shown in Table 2. 

Trifactorial experiment. This experiment is also 
similar to the previous experiment but considering 
the crossing of three traits (seed shape, seed color 
and flower color). The data, organized by Fisher 
from Mendel's description, are shown in Table 3, 



x This independence hypothesis is also a matter of contro- 
versy (did Mendel detect linkage?) and has been discussed 
thoroughly in the literature (see Franklin et al., 2008, pages 
288-292). 
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Fig. 3. Genotypes and theoretical ratios for the bifactorial 
experiment. 
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Fig. 4. Theoretical ratios for the trifactorial experiment. 

whereas the corresponding theoretical ratios are gi- 
ven in Figure 4. 

Gametic ratios experiments. In this last series of 
experiments Mendel designed more elaborated crosses 
in order to obtain "conclusions as regards the com- 
position of the egg and pollen cells of hybrids." The 
crosses are represented in Figure 5 and the data are 
shown in Table 4. 

We will also use an organization of the data into 84 
binomial experiments, similar to the one proposed 
by Edwards (1986a, see also Franklin et al., 2008, 
Chapter 4). The data set used is described in detail 
in Appendix A. 
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Table 3 

Data from the trifactorial experiment [as organized by Fisher (1936)] 



CC Cc cc Total 
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14 
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42 
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49 
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34 
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76 
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32 


83 


37 


152 


84 


156 


81 


321 


43 


82 


41 


166 


159 


321 
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639 



Table 4 

Data from the gametic ratios experiments (Mendel, 1866) 

Observed Theoretical Traits 
frequencies ratio A B 

1 90 20 23 25 22 1 : 1 : 1 : 1 seed shape seed color 

2 110 31 26 27 26 1:1:1:1 seed shape seed color 

3 87 25 19 22 21 1 : 1 : 1 : 1 seed shape seed color 

4 98 24 25 22 27 1 : 1 : 1 : 1 seed shape seed color 

5 166 47 40 38 41 1:1:1:1 flower color stem length 



All the computations and Monte Carlo simula- 
tions described were carried out using the R software 
(R Development Core Team, 2008). The full code is 
available upon request. 

4. STATISTICAL ANALYSIS: INCRIMINATING 
EVIDENCE 

4.1 Fisher's Analysis 

As mentioned in Section 2, Fisher (1936) presents 
a very detailed analysis of both Mendel's experi- 
ments and data. Here we will concentrate on a par- 
ticular part of the analysis, the chi-square analysis 
summarized in Table V, page 131, of Fisher (1936), 
which is reproduced in Table 5. This table has been 
the subject of a lot of debate, and it constitutes the 
main evidence for the "too good to be true" aspect of 
Mendel's data as claimed by Fisher. We later present 
a new explanation for this evidence. 

The analysis is very simple to describe: for each 
separate experiment, Fisher performed a chi-square 
goodness-of-fit test, where Hq specifies the probabil- 
ities implied by the theoretical ratios. Note that, for 
the two category cases, this is equivalent to the usual 
asymptotic test for a single proportion. Then he ag- 
gregated all the tests by summing the chi-square 
statistics as well as the associated number of degrees 
of freedom and computed an aggregated p-value of 



Table 5 

Fisher's chi-square analysis ("Deviations expected and 
observed in all experiments") 



Probability 



Experiments 


Expectation 


x 2 


of exceeding 
deviations 
observed 


3 : 1 ratios 
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2.1389 


0.95 


2 : 1 ratios 
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5.1733 


0.74 


Bifactorial 


8 


2.8110 


0.94 


Gametic ratios 


15 


3.6730 


0.9987 


Trifactorial 


26 


15.3224 


0.95 


Total 


64 


29.1186 


0.99987 


Illustrations of 








plant variation 


20 


12.4870 


0.90 


Total 


84 


41.6056 


0.99993 



0.99993. This would mean that if Mendel's exper- 
iments were repeated, under ideal conditions such 
that all the null-hypotheses are true, and all the 
Bernoulli trials — within and between experiments — 
are independent, the probability of getting an overall 
better result would be 7/100,000. 

Fisher's chi-square results were recomputed just 
to confirm that we are working with exactly the 
same data and assumptions. The results, given in 
the first 4 columns of Table 6, show that the statis- 
tics (Xobs) are identical to Fisher's values, but there 
are some differences in the p- values which certainly 
reflect different methods of computing the chi-square 
distribution function [p- value (Xdf ) denotes the p- value 
computed from a x 2 distribution with df degrees of 
freedom, that is, P(xl f > xl hs )}- 

The table also gives the Monte Carlo (MC) esti- 
mates of the p-values (and corresponding standard 
errors, se) based on 1,000,000 random repetitions of 
the experiments using binomial or multinomial sam- 
pling, whichever is appropriate, as considered by Fi- 
sher. A more detailed description of the Monte Carlo 
simulation is given in Appendix B.l. 
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Comparing the list of Xdf p-values with the list of 
MC p- values, the conclusion is that the approxima- 
tion of the sampling distribution of the test statis- 
tic by the chi-square distribution is very accurate, 
and that Fisher's analysis is very solid [our results 
are also in accordance with the results of similar 
but less extensive simulations described in Novitski 
(1995)]. Moreover, we can also conclude that the evi- 
dence "against" Mendel is greater than that given by 
Fisher, since an estimate of the probability of getting 
an overall better result is now 2/100,000. We have 
also repeated the chi-square analysis considering the 
84 binomial experiments (results given in the next 
3 columns of Table 6) and concluded that the two 
sampling models are almost equivalent, the second 
one (only binomial) being slightly more favorable 
to Fisher and less favorable to Mendel. Acting in 
Mendel's defense, the results will be more convinc- 
ing if we prove our case under the least favorable 
scenario. Thus, for the remaining investigation, we 
use only the binomial model and data set. 

Franklin et al. (2008, pages 29-67) provide a com- 
prehensive systematic review of all the papers pub- 
lished since the 1960s in reaction to Fisher's ac- 
cusations. The vast majority of those authors try 
to put forward arguments in Mendel's defense. We 
only highlight here some of the more relevant con- 
tributions regarding specifically the "too good to be 
true" conclusion obtained from the chi-square anal- 
ysis. The majority of the reactions/arguments can 
be generically classified into three categories. 

In the first category we consider those who do not 
believe in Fisher's analysis. This is the case of Pil- 



grim (1984, 1986) who in the first paper affirms to 
have detected "four paradoxical elements in Fisher's 
reasoning" and who, in the second, claims to have 
been able to show where Fisher went wrong. Pil- 
grim's arguments are related to the application of 
the chi-square global statistic and were refuted by 
Edwards (1986b). 

As a second category, those who, in spite of believ- 
ing that Fisher's analysis is correct, think it is too 
demanding and propose alternative ways to analyze 
Mendel's data. Edwards (1986a) analyzes the distri- 
bution of a set of test statistics, whereas Seidenfeld 
(1998) analyzes the distribution of a set of p-values. 
They both find the "too good to be true" charac- 
teristic and come to the conclusion that Mendel's 
results were adjusted rather than censored. 2 The 
methods of Leonard (1977) and Robertson (1978), 
who analyzed only a small part of the data, could 
also be classified here, but, according to Piegorsch 
(1983), their contribution to advance the debate was 
marginal. 

Finally, as a third category, those who believe 
Fisher's analysis is correct under its assumptions 
(binomial/multinomial sampling, independent exper- 
iments) and then try to find a reason or explanation, 
other than deliberate cheating, for the observation 
of a very high global p-value. Such an explanation 



2 These are the precise words used in the cited references 
(Edwards, 1986b and Seidenfeld, 1998). They mean that some 
results have been slightly modified to fit Mendel's expecta- 
tions ("adjusted"), instead of just being eliminated ("cen- 
sored" or "truncated"). 
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Fig. 5. Schematic representation of the gametic ratios experiments: (a) experiments 1~4 in Table 4', (b) experiment 5 in 
Table 4. 



Table 6 

Results of the chi-square analysis considering different models and methods for computing /estimating p-values. Each line corresponds to a different type of 
experiment in Mendel's paper: single trait, 3 : 1 ratios; single trait, 2 : 1 ratios; bifactorial (BF); gametic ratios (GR); trifactorial (TF); and illustrations of plant 

variation (PV). df : degrees of freedom of the asymptotic distribution of the x 2 test statistic under Ho (Mendel's theory); Xoba-' observed value of the \ 2 test 
statistic; p-value (Xdt) : p-value computed assuming that the test statistic follows, under Ho, a x 2 distribution with df degrees of freedom; p-value (MC): p-value 

estimated from Monte Carlo simulation, se: standard error of the p-value (MC) estimate 
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0.9733 


0.8348 
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has to imply the failure of at least one of those two 
assumptions. Moreover, that failure has to occur in 
a specific direction, the one which would reduce the 
chi-square statistics: for instance, the distribution 
of the phenotypes is not binomial and has a smaller 
variance than the binomial. The various explana- 
tions that have been put forward can be divided 
into the following: biological, statistical and metho- 
dological. 

Among the biological candidate explanations, one 
that received some attention was the "Tetrad Pollen 
Model" (see Fairbanks and Schaalje, 2007). 

Few purely statistical explanations have been pro- 
posed and most of them are anecdotal. One that 
raised some discussions was a proposal of Weiling 
(1986) who considers, based on the tetrad pollen 
model just mentioned, a distribution with smaller 
variance than the binomial for some of the experi- 
ments, and hypergeometric for other experiments. 

The majority of the suggested explanations are 
of a methodological nature: the "anonymous" assis- 
tant (Fisher); sequential procedures, like stopping 
the count when the results look good (several au- 
thors); discard plants or complete experiments due 
to suspicions of some experimental error, like pollen 
contamination (Dobzhansky, 1967); luck(?); inher- 
ent difficulties in the classification of the pheno- 
types (Root-Bernstein, 1983); data selection for pre- 
sentation (Di Trocchio, 1991; Fairbanks and Ryt- 
ting, 2001). 

It is important to keep in mind that for an expla- 
nation to be acceptable as the solution to the con- 
troversy it must fulfill a number of conditions: (i) 
it must be biologically plausible and/or experimen- 
tally verifiable; (ii) it must be statistically correct 
and pass the chi-square and eventually other statis- 
tical analyses aiming at disentangling the enigma; 
and (iii) assuming that Mendel's theory is correct 
and that he is not guilty of any deliberate fraud, 
it has to find support in and it can not contradict 
Mendel's writings. The fact is that all the explana- 
tions which were proposed up to now failed in one 
or other of these requirements. 

In summary, Fisher's analysis has resisted all at- 
tempts to be either refuted or explained. Our sim- 
ulation also confirms that, under the standard as- 
sumptions, Fisher's tests and conclusions are cor- 
rect. 

4.2 Analysis of p-Values 

As mentioned in Section 3, Edwards (1986a) pro- 
posed an organization of the data into 84 binomial 



experiments. He then used the data to compute what 
he called (signed) x values, that is, the square root 
of the chi-square statistic with the sign of the devi- 
ation ("+" if observed > expected and "— " if ob- 
served < expected). Since all the tests have one de- 
gree of freedom and, assuming that Mendel's theory 
is correct, the x values should follow approximately 
a standard normal distribution. However, a normal 
qqplot of those values shows apparently a large de- 
viation from normality (Franklin et al., 2008, Fig- 
ures 1.1 and 1.2, page 49). From the shape of the 
plot Edwards (1986a) concluded that it appears to 
be more likely that Mendel's results were adjusted 
rather than truncated. This conclusion, to which 
Seidenfeld (1998) also arrives, and later Franklin et 
al. (2008) agree, would render some of the most plau- 
sible methodological explanations not viable. 

Another approach is to analyze the p- values of the 
individual x\ tests. This idea was explored by Sei- 
denfeld (1998); (see also Franklin et al., 2008, Fig- 
ures 1.3 and 1.4, page 59), although not so system- 
atically as in the analysis provided here. The 84 % 
values, along with the 84 p-values, are also given in 
Appendix A. 

As for Fisher's and Edwards' analysis, we know 
what to expect under the ideal assumptions. That 
is, if: (i) Mendel's theory is valid for all the experi- 
ments, or, equivalently, if the null hypotheses of the 
chi-square tests are true in all cases; (ii) the ex- 
periments were performed independently and as de- 
scribed in Mendel's paper; and (iii) the chi-square 
approximation is valid, then the p-values follow a 
uniform (0, 1) distribution. Therefore, the plot of the 
empirical cumulative distribution function (e.c.d.f.) 3 
of the p- values should be close to the diagonal of the 
(0,1) x (0,1) square. However, the e.c.d.f., plotted 
in Figure 6, reveals a marked difference from uni- 
formity. This visual assertion was confirmed by a 
Kolmogorov-Smirnov (K-S) goodness-of-fit test (p- 
value = 0.0036, details in Appendix B.2). We can 
therefore conclude with a high confidence that the 
distribution of the p- values deviates from a uniform 
(0, 1) distribution. It is then natural to wonder about 
the kind of deviation and its meaning. In Figure 6 
we also plot the cumulative distribution function 
(c.d.f.) of the maximum of two uniform (0, 1) ran- 
dom variables, y = x 2 , since this is central to the 
explanation that we give later for Mendel's results. 



3 The e.c.d.f. is defined, for a random sample of size n, as 
F„(x) = {n. of observations < x}/n. 
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Fig. 6. Empirical cumulative distribution function of the 
p-values (stair steps line); cumulative distribution function of 
the uniform (0,1) random variable (straight line); cumulative 
distribution function of the maximum of two (0, 1) uniform 
random variables (curve). 
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Fig. 7. Histogram of the 84 p-values observed [the dashed 
line indicates the expected frequencies under the uniform (0, 1) 
distribution]. 

The histogram of the p-values (Figure 7) is help- 
ful for our argumentation. One could perhaps think 
that the uniform distribution is not a good fit for 
the sample of p-values because some of the null hy- 
potheses are not true. But if that were the case, we 
would observe an excess of values close to 0, and 
the histogram shows precisely the opposite. Possi- 
ble reasons for this to happen are as follows: either 
the data shows that the hypotheses are "more true," 
that is, the data are better than expected under the 
null hypotheses, or there is something wrong with 
the assumptions (and possible explanations are, as 
for the chi-square analysis, smaller variance than bi- 
nomial, or lack of independence). 



In conclusion, the p-value analysis shows that the 
probability of obtaining overall results as good or 
better than those obtained by Mendel (under the 
assumptions) is about 4/1000. This "evidence" is 
not as extreme as the 2/100,000 resulting from the 
chi-square analysis but points in the same direction. 

5. A PLAUSIBLE EXPLANATION 
5.1 A Statistical Model for the p- Values 

In the previous section we have shown that there 
is strong evidence that the p-values are not uni- 
formly distributed. What is then their distribution, 
and how can it be explained? 

The shape of the e.c.d.f. provides a hint: it re- 
sembles the function x 2 , which is the c.d.f. of the 
maximum of two uniform (0,1) random variables, 
as can easily be shown. It appears that some c.d.f. 
intermediate between that corresponding to a uni- 
form (0, 1) random variable and that corresponding 
to the maximum of two uniform (0,1) random vari- 
ables best fits the e.c.d.f. of the sample p-values (see 
Figure 6). 

One explanation for this is the following: suppose 
Mendel has repeated some experiments, presumably 
those which deviate most from his theory, and re- 
ports only the best of the two. A related possibil- 
ity was suggested by Fairbanks and Rytting (2001, 
page 743): "We believe that the most likely explana- 
tion of the bias in Mendel's data is also the simplest. 
If Mendel selected for presentation a subset of his ex- 
periments that best represented his theories, x 2 anal- 
ysis of those experiments should display a bias." The 
authors support this explanation with citations from 
Mendel's work. We have found only an attempt to 
verify the effect of such a selection procedure on the 
chi-square analysis (footnote number 62, page 73, of 
Franklin et al., 2008), but it seems to lead to the 
wrong conclusion, as we conclude later that the ef- 
fect of the given explanation on the chi-square anal- 
ysis is very small. Moreover, the explanation appears 
to have been abandoned because "It does not [how- 
ever] address the demonstration, by both Edwards 
and Seidenfeld, that Mendel's data had not merely 
been truncated, but adjusted" (Franklin et al., 2008, 
page 62). 

Both procedures described in the previous para- 
graph for selecting the data to be presented can 
be modeled by assuming that an experiment is re- 
peated whenever its p-value is smaller than a, where 
< a < 1 is a parameter fixed by the experimenter, 
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and then only the one with the largest p-value is 
reported. Under this selection model (from now on 
named "model A"), the c.d.f. of the p- values of the 
experiments reported is given by 



estimating a. This estimate is the value of a which 
minimizes the K-S statistic, 



(5.1) F a (x) 



x , 

(1 + a)x 



a, 



if < x < a, 
if a < x < 1. 



Proof. For a given experiment, denote by X 
the p-value effectively reported. We have that X = 
X\ , if X\ > a and X = max(Xi , X2) if X\ < a, where 
X\ and X2 represent the p-values obtained in the 
first and the second realization of the experiment (if 
there is one), respectively. Assume that X\ and X2 
are independent and identically distributed continu- 
ous uniform (0, 1) random variables (i.e., the two re- 
alizations of the experiment are independent and the 
associated null hypothesis is true), that is, P{X\ < 
x) = P{X2 < x) = x, < x < 1. In the derivation of 
F a (x) = P(X < x), the cases < x < a and a < x < 
1 are considered separately. 

If < x < a, 

P(X <x) = P(max(X 1 ,X 2 ) < x) 

= P{{X 1 < x} n {X 2 < x}) 
= P(Xi < x)P(X 2 <x) = x 2 . 

If ct<x< 1, 
P(X < x) 

= p(({x <x}n{x 1 <«}) 
u({x<x}n{x 1 >q})) 

= P({X<x}n{Xi <a}) 

+ P({X <x}D{X 1 >a}) 
= P({max(X 1 ,X 2 ) <x}D {X 1 < a}) 

+ P({Xi < x} n {X Y > a}) 

= P({X 1 < a} n {X 2 < x}) + P{a <X 1 <x) 

= x x a + (x — a). □ 

Suppose that model A holds but a is unknown and 
must be estimated using the available sample of 84 
binomial p- values. The minimum distance estimator 
based on the Kolmogorov distance, also called the 
"Minimum Kolmogorov-Smirnov test statistic esti- 
mator" (Easterling, 1976), provides one method for 



(5.2) 



D(a) = sup \F n 



F a (x)\ 



for testing the null hypothesis that the c.d.f. of the 
p- values is F a . Equivalently, the estimate can be de- 
termined by finding the value of a which maximizes 
the p-value of the K-S test, p(a), since p(-) is a 
strictly decreasing function of D(-). 

Figure 8 shows the plot of the K-S p- values, p(a), 
as a function of a, together with the point esti- 
mate, a = 0.201 (D = 0.0623, P = 0.8804), and a 




Fig. 8. Plot of the p-value of the K-S test as a function 
of the parameter, showing the point estimate and the 90% 
confidence interval, for model A. 




4 Note that this is just an idealized model on which to base 
our explanation. We are not suggesting that Mendel actually 
computed p-values! 



Fig. 9. Empirical cumulative distribution function of the 
p-values and fitted model (solid line: a — 0.201; dashed lines: 
90% confidence limits). 
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Fig. 10. Plot of the p-value of the K-S test as a function 
of the parameter, showing the point estimate and the 90% 
confidence interval, for model B. 

90% confidence interval for a, (0.094; 0.362). A de- 
tailed explanation on how these figures were ob- 
tained is given in Appendix B.3. Figure 9 confirms 
the good model fit. 

This model can also be submitted to Fisher's chi- 
square analysis. Assuming it holds for a certain value 
«0i we may still compute "chi-square statistics," but 
the p- values can no longer be obtained from the chi- 
square distribution. However, they can be accurately 
estimated by Monte Carlo simulation. The differ- 
ence to the previous simulations is that statistics 
and (x1) p-values were always computed (for each 
of the 84 binomial cases and each random repeti- 
tion) and whenever that p-value was smaller than 
ao another binomial result was generated and the 
statistic recorded was the minimum of the two. 

The simulation results obtained for three values 
of a (point estimate and limits of the confidence 
interval) are presented in the three columns of Ta- 
ble 6 under the heading "Model A." The p-values 
in these columns (and especially those correspond- 
ing to a = 0.201) do not show any sign of being too 
close to one anymore, in fact, they are perfectly rea- 
sonable. In Appendix C we present a more detailed 
(and technical) justification of the results obtained. 

The conclusion is that our model explains Fisher's 
chi-square results: Mendel's data are "too good to 
be true" according to the assumption that all the 
data presented in Mendel's paper correspond to all 
the experiments Mendel performed, or to a random 
selection from all the experiments. When this as- 
sumption is replaced by model A the results can no 



longer be considered too good. So we conclude that 
model A is a reasonable statistical explanation for 
the controversy. We do not pretend that it is neces- 
sarily the "true" model; however, it is very simple 
and does provide extra insight into the complexity of 
this historical debate and in this sense it is useful. 
As G.E.P. Box said, "All models are wrong, some 
models are useful." 

We have just seen how the suggested selection 
mechanism can make Mendel's results (which we 
know are in fact correct) look too correct. This raises 
a related question of general interest to all experi- 
mental sciences: is it possible to make an incorrect 
theory look correct by applying this or a similar se- 
lection mechanism? Although a detailed answer to 
this question is beyond the scope of this paper, in 
Appendix D we give an idea on how a generalization 
of model A can be used to explore the question. 

5.2 Alternative Models 

No doubt there are many models, perhaps more 
complicated than ours, that explain Mendel's data 
as well as, or perhaps better than, ours. A relevant 
question to ask, then, is whether any model similar 
to ours, more specifically, a one parameter model 
with c.d.f. varying between x and x 2 , would produce 
similar results and also a reasonable interpretation. 

To show that the answer to this question is nega- 
tive, we have considered an alternative model, model 
B, with distribution function computed lin- 
ear combination of the "extreme" models, that is, 
with c.d.f. given by Fp(x) = (1 — (3) x x + (3 x x 2 , 
with < /3 < 1. This is mathematically simpler than 
model A and its interpretation in terms of the de- 
sign of the experiments could be: Mendel would also 
decide to repeat some experiments and report only 
the best result of both (the original and the rep- 
etition), but the decision to repeat would be taken 
randomly with probability (3, for instance, by throw- 
ing a fair coin (/3 = 0.5) or something similar. Ap- 
plying the methods described in Appendix B.3 to 
this model, we obtain (see Figure 10) $ = 0.45 (K- 
S test: D = 0.0875, P = 0.5131) and CI 90% ((3) = 
(0.261; 0.634). Figure 11 shows the e. c.d.f. of the p- 
values and the c.d.f. of model B with (3 = 0.45 (solid 
line) and (3 = 0.261, 0.634 (dashed lines). Compared 
to Figure 9, the fit of model B looks worse than the 
fit of model A, but it could still be considered ac- 
ceptable. However, in what concerns the chi-square 
analysis, model B is unable to produce good results 
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X 

Fig. 11. E.c.d.f. of the p-values and alternative model (solid 
line: (3 = 0.45; dashed lines: 90% confidence limits). 

(cf. the last three columns in Table 6). The aggre- 
gated p- value (84 df) still points to "too good to be 
true" except maybe for the last column, which cor- 
responds to the odd situation of randomly repeating 
about 60% of the experiments! We have presented 
model B as just an exercise to show a specific point, 
it does not correspond to a plausible procedure as 
does model A. 

5.3 Further Support for the Proposed Model 

As a harder challenge, we observed the behavior 
of each of the models in the context of Edwards' 
chi values analysis, mentioned at the beginning of 
Section 4.2. The results of a simple simulation exer- 
cise are represented in the four plots in Figure 12. 
All the plots are normal quantile-quantile plots, and 
contain the representation of the actual sample of 84 
X values (thick line). Each plot also represents 100 
samples of simulated 84 x values, generated by the 
corresponding model (gray thin lines), plus a "syn- 
thetic" sample obtained by averaging the ordered 
observations of those 100 simulated samples (inter- 
mediate line). Plot (a): the samples were generated 
from a standard normal random variable (i.e., from 
the asymptotic distribution of the x values under 
the ideal assumptions, binomial sampling and in- 
dependent experiments). Plot (b): in this case the 
X values were obtained (by transformation of the 
X 2 values) from the first 100 samples used to ob- 
tain the results given in the columns with heading 
"Edwards" in Table 6. Plot (c): similar to the pre- 
vious but with the samples generated under model 
A. Plot (d): idem with model B. 



From the top plots we conclude that the Normal 
and the Binomial models are very similar and do 
not explain the observed values, whereas from the 
bottom plots we can see that model A provides a 
much better explanation of the x values observed 
than model B. These conclusions are no longer sur- 
prising, in the face of the previous evidence; how- 
ever, we shall remark that the several analyses are 
not exactly equivalent, so the previous conclusions 
would not necessarily imply this last one. 

Besides the statistical evidence, which by itself 
may look speculative, the proposed model is sup- 
ported by Mendel's own words. The following quo- 
tations from Mendel's paper (Mendel, 1866, page 
numbers from Franklin et al., 2008) are all relevant 
to our interpretation: 

u it appears to be necessary that all mem- 
bers of the series developed in each succes- 
sive generation should be, without excep- 
tion, subjected to observation" (page 80). 

From this sentence we conclude that Mendel was 
aware of the potential bias due to incomplete ob- 
servation, thus, it does not seem reasonable that he 
would have deliberately censored the data or used 
sequential sampling as suggested by some authors: 

"As extremes in the distribution of the two 
seed characters in one plant, there were 
observed in Expt. 1 an instance of 43 round 
and only 2 angular, and another of 14 round 
and 15 angular seeds. In Expt. 2 there was 
a case of 32 yellow and only 1 green seed, 
but also one of 20 yellow and 19 green" 
(page 86). 

"Experiment 5, which shows the greatest 
departure, was repeated, and then in lieu 
of the ratio of 60:40, that of 65:35 re- 
sulted" (page 89). 

Here he mentions repetition of an experiment but 
gives both results (note that he decided to repeat an 
experiment with p- value = 0.157). However, later he 
mentions several further experiments (pages 94, 95, 
99, 100, 113) but presents results in only one case 
(page 99) and in another suggests that the results 
were not good (page 95): 

u In addition, further experiments were made 
with a smaller number of experimental plants 
in which the remaining characters by twos 
and threes were united as hybrids: all yielded 
approximately the same results" (page 94). 
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Sample quantiles Sample quantiles 

Fig. 12. Each part of the figure contains a normal quantile-quantile plot of the original 84 Edwards' \ values (solid thick 
line), and of 100 samples of simulated 84 X values from each model (gray thin lines) as well as an intermediate line, located 
in the "middle" of the gray lines, corresponding to a "synthetic" sample obtained by averaging the ordered observations of the 
100 simulated samples. 



"An experiment with peduncles of differ- 
ent lengths gave on the whole a fairly sat- 
isfactory results, although the differentia- 
tion and serial arrangement of the forms 
could not be effected with that certainty 
which is indispensable for correct exper- 
iment" (page 95). 

"In a further experiment the characters of 
flower- color and length of stem were ex- 
perimented upon..." in this case results 
are given, and then concludes "The the- 
ory adduced is therefore satisfactorily con- 
firmed in this experiment also" (pages 99/100). 

"For the characters of form of pod, color 
of pod, and position of flowers, experiments 
were also made on a small scale and re- 
sults obtained in perfect agreement'''' (page 100). 

"Experiments which in this connection were 
carried out with two species of Pisum . . . 



The two experimental plants differed in 5 
characters, ..." (page 113). 

It is likely that the results omitted were worse and 
that Mendel may have thought there would be no 
point in showing them anymore (he gave examples 
of bad fit to his theory before, page 86). 

Our model may be seen as an approximation for 
the omissions described by Mendel. In conclusion, an 
unconscious bias may have been behind the whole 
process of experimentation and if that is accepted, 
then it explains the paradox and ends the contro- 
versy at last. 

6. CONCLUSION 

Gregor Mendel is considered by many a creative 
genius and incontestably the founder of genetics. 
However, as with many revolutionary ideas, his laws 
of heredity (Mendel, 1866), a brilliant and impres- 
sive achievement of the human mind, were not im- 
mediately recognized, and stayed dormant for about 
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35 years, until they were rediscovered in 1900. When 
Ronald Fisher, famous statistician and geneticist, 
considered the father of modern statistics, used a 
chi-square test to examine meticulously the data 
that Mendel had provided in his classical paper to 
prove his theory, he concluded that the data was 
too close to what Mendel was expecting, suggest- 
ing that scientific misconduct had been present in 
one way or another. This profound conflict raised 
a longstanding controversy that has been engaging 
the scientific community for almost a century. Since 
none of the proposed explanations of the conflict is 
satisfactory, a large number of arguments, ideas and 
opinions of various nature (biological, psychological, 
philosophical, historical, statistical and others) have 
been continually put forward just like a never ending 
saga. 

This study relies on the particular assumption that 
the experimentation leading to the data analyzed by 
Fisher was carried out under a specific unconscious 
bias. The argument of unconscious bias has been 
considered a conceivable justification by various au- 
thors who have committed themselves to study some 
variations of this line of reasoning (Root-Bernstein, 
1983; Bowler, 1989; Dobzhansky, 1967; Olby, 1984; 
Meijer, 1983; Rosenthal, 1976; Thagard, 1988; Nis- 
sani and Hoefler-Nissani, 1992). But all these at- 
tempts are based on somehow subjective interpreta- 
tions and throw no definite light on the problem. On 
the contrary, in this paper the type of unconscious 
bias is clearly identified and a well-defined statis- 
tical analysis based on a proper statistical model 
is performed. The results show that the model is a 
plausible statistical explanation for the controversy. 

The study goes as follows: (i) Fisher results were 
confirmed by repeating his analysis on the same real 
set of data and on simulated data, (ii) inspired by 
Edwards' (1986a) approach, we next idealized a con- 
venient model of a sequence of binomial experiments 
and recognized that the p-value produced by this 
model shows a slight increase, although it keeps very 
close to the result obtained by Fisher. This gave 
us confidence to work with this advantageous struc- 
ture, (iii) we focused on the analysis of the p-values 
of the previous model and realized that the p- values 
do not have a uniform distribution as they should, 
(iv) the question arose of what the distribution of 
the p-values could be, and we arrived at the sat- 
isfactory model we propose in the text, (v) finally, 
assuming that our model holds, and repeating the 



chi-square analysis adopted by Fisher, one sees that 
the impressive effect detected by Fisher disappears. 

Returning to Fisher's reaction to the paradoxical 
situation he encountered, one may think that, de- 
spite his remarkable investigation (Fisher, 1936) of 
Mendel's work, to prove that something had gone 
wrong with the selection of the experimental data, 
apparently neither did he question how could the 
data have been generated nor did he identify the 
defects of the sample or give a statistical explana- 
tion for the awkward result. In the end Fisher left 
an inescapable global impression of scientific mal- 
practice, a conclusion that he based on a sound sta- 
tistical analysis. 

Probity is an essential component of the scientific 
work that should always be contemplated to guar- 
antee credible final results and conclusions. That is 
why all measures should be taken to make sure that 
neither conscious nor unconscious bias will affect the 
results of the research work. Unfortunately there ex- 
ists unconscious bias, an intrinsic automatic human 
drive based on culture, social prejudice or motiva- 
tion that is difficult to stop. Hidden bias influences 
many aspects of our decisions, our social behaviour 
and our work. That is why scientific enterprises in- 
cluding honourable doctors and well-intentioned pa- 
tients do not dispense the scientific techniques based 
on blind or double blind procedures. In Mendel's 
case we all know that there was a profound mo- 
tivation that could have triggered the bias and in 
those days we guess that the attention given to un- 
conscious bias may have been poor or it may have 
not existed at all. Frequently science ends up in de- 
tecting errors or fraud that have been induced by 
bias. But there are no errors in Mendel's laws, or 
are there? So why are we worried? Anyway, we wish 
that Mendel's unconscious bias coincides with the 
arrangement we are suggesting in this paper, be- 
cause if Mendel did what we think he did, the con- 
troversy is finally over. 

APPENDIX A: THE 84 BINOMIAL 
EXPERIMENTS 

Edwards (1986a) organized Mendel's data as the 
result of 84 binomial experiments. Note that this 
involves decomposition of the multinomial experi- 
ments. In this study we have relied on Edwards' de- 
compositions. In order to remain as close as possible 
to Fisher's choices, the data from Table 1 — already 
binomial experiments — were included exactly as 
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shown, unlike Edwards who subtracted the "plant il- 
lustrations" from these data. According to this pro- 
cedure, experiment No. 1 (No. 2) is not indepen- 
dent of the "plant illustrations" Nos 8-17 (Nos 18- 
27). But, attending to the relative magnitude of the 
number of observations, if we had used Edwards' 
numbers, the final results would have not been too 
different and the conclusions would have been the 
same. We have also considered the theoretical ratio 
of 2 : 1 throughout the experiments involving (F3) 
generations, instead of the ratio 0.63 : 0.37 that Ed- 
wards used in some cases. The number of binomial 
experiments per pair of true probabilities (ratio) is 
as follows: 42 cases with 0.75 : 0.25 (3 : 1); 15 cases 
with 0.5 : 0.5 (1 : 1); 27 cases with 2/3 : 1/3 (2 : 1). 

Table 7 contains the following information about 
the 84 binomial experiments considered in this pa- 
per: 

Trait: binary variable under consideration (the cat- 
egory of interest is called a "success" and the other 
category is a "failure") using the following cod- 
ing: A (seed shape, round or wrinkled), B (seed 
color, yellow or green), C (flower color, purple or 
white), D (pod shape, inflated or constricted), E 
(pod color, yellow or green), F (flower position, 
axial or terminal), G (stem length, long or short). 
The usual notation is used to distinguish pheno- 
type (italic inside quotation marks) from genotype 
(italic), and the dominant form (upper case) from 
the recessive (lowercase); see also Section 3 and 
Table 1. 

n: number of observations (Bernoulli trials) of the 
experiment. 

Observed: observed frequencies of "successes" (ni) 
and "failures" (n — n\). Under the standard as- 
sumptions n\ ~ Bin(n,p), where p is the proba- 
bility of a "success" in one trial. 

Po : theoretical probability of a "success" under 
Mendel's theory (H^-.p = po). 

X- observed value of the test statistic to test Hq 
a gainst H 1 : p 7^ po, given by {n\ — npo)/ 
y/npo(l-p ). 

j>-value: p-value of the test. Assuming n is large, 
p-value = P(xi > X 2 )- 

APPENDIX B: TECHNICAL DETAILS 

B.l Simulation of the Chi-Square Analysis 

In each of the 1,000,000 repetitions a replicate 
of Mendel's complete data set was generated, us- 
ing the probabilities corresponding to the theoret- 
ical ratios, and multinomial distributions with the 



appropriate number of categories (which reduces to 
the binomial distribution for the experiments with 
two categories and is strictly multinomial for the 
remaining, bifactorial, trifactorial and gametic ra- 
tios). For each replicate, a total "chi-square" statis- 
tic was computed as Fisher did for the actual data 
set. From the 1,000,000 replicates of the test statis- 
tic it is possible to estimate the p-value of the test 
without knowledge of the sampling distribution of 
the test statistic. Recall that the MC estimate of a 
p-value (or simulated p-value) associated to a cer- 
tain observed statistic (which increases as the data 
deviate from the null hypothesis) is the number of 
repetitions for which the corresponding simulated 
statistic is larger than the observed statistic (Xobs)' 
divided by the number of repetitions. If we denote 
an MC estimate of a p-value by P, the correspond- 
ing estimated standard error is se = \/P(l — P)/B, 
where B is the number of random repetitions. These 
figures are also reported in Table 6. 

B.2 Analysis of the p-Values 

The Kolmogorov-Smirnov (K-S) test is a goodness- 
of-fit test based on the statistic D = sup^. \F n {x) — 
Fq(x)\, where F n {x) is the e.c.d.f. obtained from a 
random sample (x±, . . . , x n ) and Fq(x) is a hypoth- 
esized, completely specified, c.d.f. [D is simply the 
largest vertical distance between the plots of F n (x) 
and Fq{x)\. This test was selected for analyzing the 
c.d.f. of the p-values because it is more powerful 
for detecting deviations from a continuous distribu- 
tion than other alternatives such as the chi-square 
goodness-of-fit test (Massey, 1951). Under the ap- 
propriate conditions [Fq(x) is continuous, there are 
no ties in the sample] , the exact p- value of the K-S 
test can be computed. In our analysis these con- 
ditions are not exactly met (the true c.d.f. is not 
continuous and because of that there are ties in the 
data), so it is necessary to proceed with caution. 

The first K-S test performed intended to test the 
uniformity of the 84 p- values and produced D = 
0.1913 (P = 0.0036). The "exact" p-value was com- 
puted after eliminating the ties by addition of a 
small amount of noise to each data point (random 
numbers generated from a normal distribution with 
zero mean and standard deviation 10~ 7 ). 

As there are several approximations involved, we 
checked the whole procedure by performing a sim- 
ulation study similar to the one described in Sec- 
tion 4.1 for the chi-square analysis. In 1,000,000 ran- 
dom repetitions of the sample of 84 p- values a simu- 
lated p-value of 0.0038 (se = 0.00006) was obtained 
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Table 7 

Data from the 84 binomial experiments 



Type of Observed 



experiment 


INo. 


Trait 


n 


ni 


n — ni 


Po 


X 


p- value 


Single trait 


1 


A 




7324 


5474 


1850 


3/4 


-0.513 


0.608 




2 


B 




8023 


6022 


2001 


3/4 


0.123 


0.903 




3 


C 




929 


705 


224 


3/4 


0.625 


0.532 




4 


D 




1181 


882 


299 


3/4 


-0.252 


0.801 




5 


E 




580 


428 


152 


3/4 


-0.671 


0.502 




6 


F 




858 


651 


207 


3/4 


0.591 


0.554 




7 


G 




1064 


787 


277 


3/4 


-0.779 


0.436 


Illustrations 





A 




i — » 

O ( 


45 


12 


O 1 A 

3/4 


0.688 


0.491 


of plant 


9 


A 




35 


27 


8 


3/4 


0.293 


0.770 


variation 




A 




31 


24 


1 


3/4 


noil 
0.311 


0. /ob 




i i 
11 


A 




20 


19 


10 


O 1 A 

3/4 


— 1.1(9 


A OOO 
0.208 




12 


A 




43 


32 


11 


3/4 


—0.088 


0.930 




id 


A 




32 


26 


a 



O 1 A 

3/4 


0.81 1 


A A 1 A 
U.414 




1 A 


A 




112 


oo 


24 


"\IA 
o/i 


87^i 


u.ooo 




15 


A 




32 


22 


10 


3/4 


-0.817 


0.414 




16 


A 




34 


28 


6 


3/4 

/ 


0.990 


0.322 




17 


A 

A V 




32 


25 


7 


3/4 


0.408 


0.683 




18 


R 
13 




36 


25 


11 


3/4 


-0.770 


0.441 




19 


R 
13 




39 


32 


7 


3/4 


1.017 


0.309 




20 


D 
J_) 




19 


14 


5 


3/4 


-0.133 


0.895 




21 


D 
13 




97 


70 


27 


3/4 


-0.645 


0.519 




22 


D 
13 




37 


24 


13 


3/4 


-1.424 


0.155 




23 


R 

13 




26 


20 


6 


3/4 


0.227 


0.821 




21 


R 
13 




45 


32 


13 


3/4 


-0.603 


0.547 




25 


R 
13 




53 


44 


9 


3/4 


1.348 


0.178 




26 


13 




64 


50 


14 


3/4 


0.577 


0.564 




27 


R 
13 




62 


44 


18 


3/4 


-0.733 


0.463 


Bifactorial 


28 


A 

A 




556 


423 


133 


3/4 


0.588 


0.557 


experiment 


29 


B amonj 


"4" 


423 


315 


108 


3/4 


-0.253 


0.801 


F 2 


30 


B amonj 


? a 


133 


101 


32 


3/4 


0.250 


0.802 


Trifactorial 


31 


A 




639 


480 


159 


3/4 


0.069 


0.945 


experiment 


32 


B anion; 


; "A" 


480 


367 


113 


3/4 


0.738 


0.461 


F 2 


33 


B amonj 


y "a" 


159 


122 


37 


3/4 


0.504 


0.615 




34 


C among 


AaBb 


175 


127 


48 


3/4 


-0.742 


0.458 




35 


C among 


AaBB 


70 


52 


18 


3/4 


-0.138 


0.890 




36 


C among 


AABb 


78 


60 


18 


3/4 


0.392 


0.695 




37 


C among 


AABB 


44 


30 


14 


3/4 


-1.045 


0.296 




38 


C among 


Aabb 


76 


60 


16 


3/4 


0.795 


0.427 




39 


C among 


AAbb 


37 


26 


11 


3/4 


-0.664 


0.506 




40 


C among 


aaBb 


79 


55 


24 


3/4 


-1.104 


0.269 




41 


C among 


aaBB 


43 


33 


10 


3/4 


0.264 


0.792 




42 


C among 


; aabb 


37 


30 


7 


3/4 


0.854 


0.393 


Single trait 


43 


A 




565 


372 


193 


2/3 


-0.417 


0.677 


(F 3 ) 


44 


B 




519 


353 


166 


2/3 


0.652 


0.515 




45 


C 




100 


64 


36 


2/3 


-0.566 


0.572 




46 


D 




100 


71 


29 


2/3 


0.919 


0.358 




47 


E 




100 


60 


40 


2/3 


-1.414 


0.157 




48 


F 




100 


67 


33 


2/3 


0.071 


0.944 




49 


G 




100 


72 


28 


2/3 


1.131 


0.258 




50 


E 




100 


65 


35 


2/3 


-0.354 


0.724 
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Table 7 
( Continued) 



Type of Observed 



PYTlPTlTtlPnt 
CiLU V;L LILIXjLHi 


No. 


Trait 




Til 


n ni 


Po 


X 


Ti- vnliip 


Bifactorial 


51 


A among U AB" 


301 


198 


103 


2/3 


-0.326 


0.744 


experiment 


52 


A among "A b" 


102 


67 


35 


2/3 


-0.210 


0.834 




53 


B among u aB" 


96 


68 


28 


2/3 


0.866 


0.386 




54 


B among Aa u B" 


198 


138 


60 


2/3 


0.905 


0.366 




55 


B among AA"B" 


103 


65 


38 


2/3 


-0.766 


0.443 


TVi +0 nri q 1 
_L I lleLC LU1 Idl 






■3D f 


245 


122 




n n^7 


fl Q71 


experiment 


57 


A among u Ab" 


113 


76 


37 


2/3 


0.133 


0.894 


(Fa) 


58 


B among 11 aB" 


122 


79 


43 


2/3 


-0.448 


0.654 




59 


B among Aa"B" 


245 


175 


70 


2/3 


1.581 


0.114 




60 


B among A A "5" 


122 


78 


44 


2/3 


-0.640 


0.522 




61 


C among AaBb 


127 


78 


49 


2/3 


-1.255 


0.210 




62 


C among AaBB 


52 


38 


14 


2/3 


0.981 


0.327 




63 


C among AABb 


60 


45 


15 


2/3 


1.369 


0.171 




64 


C among AABB 


30 


22 


8 


2/3 


0.775 


0.439 




65 


C among Aabb 


60 


40 


20 


2/3 


0.000 


1.000 




66 


C among A Abb 


26 


17 


9 


2/3 


-0.139 


0.890 




67 


C among aaBb 


55 


36 


19 


2/3 


-0.191 


0.849 




68 


C among aaBB 


33 


25 


8 


2/3 


1.108 


0.268 




69 


C among aabb 


30 


20 


10 


2/3 


0.000 


1.000 


vjdlllL Lit., 


70 




qn 




4.7 


1 19 


—0.422 


(\T\ 

U.U 1 o 


refine; 


71 


R 3 winner /4 4 


43 


20 


23 


1 12 


—0.458 


0.647 




72 


B among A a 


47 


25 


22 


1/2 


0.438 


0.662 




73 


A 


110 


57 


53 


1/2 


0.381 


0.703 




74 


B among A a 


57 


31 


26 


1/2 


0.662 


0.508 




75 


B among aa 


53 


27 


26 


1/2 


0.137 


0.891 




76 


A 


87 


44 


43 


1/2 


0.107 


0.915 




77 


B among A A 


44 


25 


19 


1/2 


0.905 


0.366 




78 


B among A a 


43 


22 


21 


1/2 


0.153 


0.879 




79 


A 


98 


49 


49 


1/2 


0.000 


1.000 




80 


B among A a 


49 


24 


25 


1/2 


-0.143 


0.886 




81 


B among aa 


49 


22 


27 


1/2 


-0.714 


0.475 




82 


G 


166 


87 


79 


1/2 


0.621 


0.535 




83 


C among Gg 


87 


47 


40 


1/2 


0.751 


0.453 




84 


C among gg 


79 


38 


41 


1/2 


-0.338 


0.736 



(the K-S statistic was larger than 0.1903 in 3807 
repetitions). This is statistically significantly larger 
than 0.0036; however, the difference is not mean- 
ingful from a practical point of view, the "exact" 
p-value is 3 digits accurate. So we concluded that it 
is acceptable to use the K-S test as described. 

There is another aspect which needs to be ana- 
lyzed. Because the outcomes of the experiments are 
binomial, yielding whole numbers, the actual distri- 
bution of the p- values is discrete, not uniform con- 
tinuous. Therefore, we decided to investigate the dif- 
ferences between the true distribution and the uni- 
form continuous. The exact distribution of the p- 
values obtained when the 84 chi-square tests are ap- 



plied to the binomial observations was determined 
in the following way. 

For a fixed experiment (with number of trials, n, 
and probability, p) we can list the n + 1 possible 
p-values along with the corresponding probabilities. 
For instance, in one of the experiments the num- 
ber of seeds (trials) is n = 35 and the true proba- 
bility of a round seed is 0.75 (under Mendel's the- 
ory, i.e., the null hypothesis). The possible values 
of round seeds observed in a repetition of this ex- 
periment are 0, 1, 2, . . . , 33, 34, 35 (y), each pro- 
ducing a possible value of the chi-square statistic 
( x 2 ( y ) = ( y - 35 x 0.75) 2 /(0.75 x 0.25 x 35)) and a 
corresponding p- value = P{x\ > X 2 (y)) with prob- 
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Table 8 

Illustration of the computations necessary to obtain the exact distribution of the p-values (n — 35, p = 0.75,) 



V 

x 2 (y) 

p- value 

P(y) 



o 

105.00 

1Q -24 

io- 21 



1 

97.15 

1Q -22 

io- 19 



25 
0.24 
0.626 
0.132 



26 
0.0095 
0.922 
0.152 



27 
0.086 
0.770 
0.152 



33 
6.94 
0.008 
0.003 



34 
9.15 
0.002 
0.0005 



35 

11.67 
0.0006 

io- 5 




FlG. 13. Equivalent to Figure 6 but showing the actual c.d.f. 
of the p-values under binomial sampling (black stair steps line 
close to the diagonal) and the c.d.f. of the maximum of two 
p-values (lower black stair steps line). 

ability given by P(y) = Cf x 0.75' y x 0.25 35 ~ y (see 
Table 8). 

Ordering the p-values and summing up the prob- 
abilities leads to the discrete c.d.f. defined by the 
points in Table 9. 

Proceeding similarly for all the 84 experiments 
and combining the lists P{y) multiplied by 1/84 
(i.e., the contribution of each experiment to the over- 
all distribution), we obtain the global probability 
function of the p-values, from which the final cu- 
mulative distribution is computed (overall there are 
14,218 distinct possible p-values, but from those the 
smallest 12,110 were not considered because their 
cumulative probability is smaller than 0.001). The 



result is shown in Figure 13. Although for some of 
the experiments, when considered individually, the 
c.d.f. of the p-values is quite different from that of 
the uniform (0, 1) distribution (like in the example 
above), when the 84 experiments are taken together 
the resulting c.d.f. of the p-values is very close to 
the straight line F(x) = x, which means that we can 
safely approximate this distribution by a continuous 
uniform distribution in (0, 1) and trust the results 
obtained with the K-S test (the approximation is 
not so good near the upper right corner, but this 
area is not relevant to this conclusion). The same 
remarks apply when the exact distribution of the 
maximum of two p-values is approximated by the 
curve y = x 2 (see also Figure 6). 

B.3 Estimation of the Parameter of Model A 

As explained in Section 5.1, we consider the esti- 
mate of a defined as the value of a which maximizes 
the p- value of the K-S test for testing the uniformity 
of the experimental p-values, denoted by p(a). The 
solution can be found by grid search, varying a in a 
finite set of equidistant points between and 1. With 
a grid width of 0.001, the value a = 0.201 was ob- 
tained. It is also possible (Easterling, 1976) to com- 
pute a 100 x 7% confidence interval for a by inver- 
sion of the K-S test. This confidence interval is the 
set of points a £ (0, 1) such that p-value(a) > 1 — 7 
(it may happen that this confidence set is empty, 
which is an indication that the model is not appro- 
priate). 

A simulation study was performed to validate this 
procedure. 1000 samples of 84 p-values were gen- 
erated from the 84 binomial experiments, but con- 
sidering the repetition mechanism of model A with 



Table 9 

The exact distribution of the p-values, when n = 35 and p = 0.75 



p-value 0.001 0.002 0.005 0.008 0.015 0.025 0.040 0.064 0.097 

c.d.f. 0.002 0.003 0.007 0.010 0.019 0.029 0.050 0.077 0.117 

p-value 0.143 0.205 0.283 0.380 0.495 0.626 0.770 0.922 

c.d.f. 0.173 0.240 0.334 0.434 0.564 0.696 0.848 1.000 
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a = 0.2. For each of those 1000 samples the point 
estimate and the 90% confidence interval for a were 
computed as described in the previous paragraph. 
The results of the simulation confirmed that the 
whole procedure is adequate and performs as ex- 
pected: the 1000 point estimates are distributed al- 
most symetrically with mean = 0.2077 (se = 0.0032), 
median = 0.194 and standard deviation = 0.101. The 
confidence set was empty in one case only. From 
the remaining 999 intervals (mean length = 0.3019, 
se = 0.0049; median length = 0.272), 895 contained 
the true value of a = 0.2, which gives an estimated 
confidence level of 89.5%, in close agreement with 
the specified 90% confidence. 

APPENDIX C: THE CHI-SQUARE ANALYSIS 
ASSUMING MODEL A 

The aim of this note is to show in detail why model 
A explains the chi-square analysis, and to derive the- 
oretically the approximate distribution of the global 
chi-square statistic which can be used to compute 
approximate p-values without the need to run sim- 
ulations. 

Fisher's chi-square analysis is based on the follow- 
ing simple reasoning: Let Xi, i = 1, . . . , 84, be the 
random variable describing the results of the ith ex- 
periment, that is, the number of observations among 
rii which are classified into a category of interest 
(which one of the two categories is the category of 
interest is not relevant). Let pi be the probability 
of an observation of that category in a single trial 
and pio the value of the same probability accord- 
ing to Mendel's theory. The standard model is Xi ~ 
Hm(rii,pi). If, furthermore, it is assumed, as Fisher 
did, that the Xi are independent and Pt^i-.pi = Pio 
is true for all i = 1, . . . , 84, it follows that 



,X 84 1 '~ d ' Bin(ni,pi ) 



Xi 



Qi 



Xi - riiPio 



y/riiPi (l - Pio 

(Xj - rupio) 2 i.i.d. 
niPio(l -Pm) 



~ a Xl 



,, 2 
J a A84- 



i=l 



We also have that E(Q T ) = 84 and var(Q T ) = 168 
[E(Qi) = 1 and var(Qj) = 2], and p-value = P(Qt > 
Qt observed) = P(Qt > 41.3376) ~ 0.99998. 



From Mendel's paper we already know that he 
performed other experiments than the 84 binomial 
experiments we have been considering. Let us as- 
sume that he has (or could have) done 2 x 84 = 168 
binomial experiments, such that for each of the re- 
ported 84 experiments there is a repetition (either 
actual or conceptual) and denote the repetition of 
Xi by Xi + s4 and the corresponding chi-square statis- 
tics by Qi and Qj+84- If for each pair (Xi, Xg4 + j) the 
selection of the reported experiment is random, then 
the observed statistics, denoted Q*, £ = 1, ...,84, 
are still i.i.d. xl an d Fisher's analysis remains valid. 
However, if the selection is not random, and is done 
according to our model A, we still have that (assum- 
ing, as Fisher did, that Xi are independent and Hof. 
Pi = Pio is true for all i = 1, ... , 168) 

Xi, . . .,X 168 1 ~ d ' Bm(rii,pio) 

Xl,---,Xl68 1 '~a^ r (0, 1) 
„ „ i.i.d. 2 

but each of the observed statistics, Q*, i = 1, . . . , 84, 
is no longer randomly chosen between Qi and Q«+84, 
in fact, they are chosen by the following rule, 



Qi 



Qi, if Qi ^ c Q , 

min(Qi, Qi +8 4), if Qi > c a , 



where c a is the 1 — a quantile of the xi distribution. 
Therefore, the Q* are i.i.d. but do not follow the 
Xi distribution, and, in consequence, = Yli=i Qt 
also does not follow the Xs4 distribution. 

The exact distribution of Q\ appears to be very 
difficult to derive; however, by the Central Limit 
Theorem (CLT), we can use a normal approxima- 
tion, 

Qt ~ a iV(84/i*,84a* 2 ), 

(C.l) 

with fi* = E(Q\) and a* 2 = var(Q*). 

Assuming that Qi ~ Xi > if i s possible to compute 
the mean and the variance of Q*, either directly or 
determining first the pdf of Q*, fq*. 

Given the reported value of the statistic, Q* ac- 
cording to model A, and the p-value computed us- 
ing the chi-square distribution, given by P = 1 — 
Fq.(Q*), with distribution function given by (5.1), 
we have that 

Fq* (x) = P{Q* <x) = P(P>l- F Qi (x)) 

= l-F P (l-F Qi (x)). 
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n=100, p: =0.45 n=200, Pi = 0.45 



n=100, Pi =0.4 




Table 10 

Mean and variance of Q* and p-values obtained using the 
normal approximation to and from the Monte Carlo 
simulation 











p- value 


p- value 


a 


C a 


M 




(normal approx.) 


(simulation) 


0.094 


2.805 


0.6636 


0.5685 


0.9814 


0.9860 


0.201 


1.635 


0.5160 


0.3662 


0.6412 


0.6577 


0.362 


0.831 


0.4164 


0.3135 


0.1076 


0.1176 



Taking derivatives on both sides yields 
dF P (u) 



,(*)■ 



du 



u=l-F Q .{x) 



f2/Q l (x)[l-F Qi (x)], ifx>c Q , 
1 (1 +a)fQ t (x), ifx<c Q , 



where fQ i (x) = e x ' 2 /y/2irx, x > 0, and FQ i (x) = 
JofQi{u)du. 

Using symbolic computation, we obtained ft* = 
1 - {2k a + (1 - a)y/2cjQ, a* 2 = 2- {Ak 2 a + (1 - 
a)^2c a k a (Ak a + l + c a ) + 2(2 + c a (2-2a + a 2 ))k a ), 
with k a = e~ Ca /ir. Table 10 gives the values of //* 
and a* 2 , as well as the p-values obtained using the 
normal approximation (C.l), for the three values of 
a considered previously. The p-values obtained in 
the simulation study (see Table 6) are also provided 
for comparison. The two columns of p-values are 
very similar. The results presented in this appendix 
are thus an independent validation of the simulation 
results, in case there was any doubt about them. 



APPENDIX D: MODEL A FOR AN 
INCORRECT THEORY 

Suppose that Mendel's theory was not right but 
that the same selection mechanism was applied (i.e., 
an experiment was repeated whenever its p-value 
was smaller than a, < a < 1, and then only the 
experiment with the largest p-value was reported). 
The difference between this case and that one con- 
sidered in Section 5.1 is that the original distribution 
of the p-values is not uniform (0, 1) but has a c.d.f. 
Fq(x) 7^ x for some < x < 1. Then, proceeding as in 
the proof of (5.1), we can conclude that the p-values 
effectively reported have a c.d.f. given by 

lFo(x)} 2 , 

if < x < a, 
[l + F (a)]F {x)-F (a) i 
if a < x < 1. 

The selection procedure would make an incorrect 
theory look correct if F*(x) is "close" to the c.d.f. 
of a uniform (0,1) random variable. The result de- 
pends on the starting point, Fq(x), which in turn 
depends on the particular test under analysis and 
on the true and hypothesized parameters, as the fol- 
lowing example shows. 

Suppose that the theory states that the success 
probability of a binomial random variable is po but 
that data are actually observed from a binomial ran- 
dom variable with success probability pi which may 
be different from po. Assuming that n is large, the 
normal approximation to the binomial leads to 

(D.l) F (x)= 



V 



>1 
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where <&(x) is the c.d.f. of a standard normal random 
variable, z = <I >_1 (1 — x/2), 

5= ^ Pl - P0) and ^ = M1Z^1. 
^npo(l-po) Po(l-Po) 

Note that, when po = Pit Fo(, x ) = x > as it should. 

Figure 14 shows the results for po = 1/2 and some 
values of n, p\ and a. We conclude that in the first 
case (n = 100, p± = 0.45) it is easy to make the the- 
ory look correct, but as n increases or p\ deviates 
from pq that becomes more difficult. 

There is, of course, the possibility of further gener- 
alizing model A by making more than 2 repetitions 
per experiment, say, k. With this extra flexibility it 
is easy to make any theory look correct. 
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